Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Warning: stack imbalance in 'lazyLoadDBfetch', 20 then 18
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching SeuratObject
We would like to include two datasets: TO_1, which is a erebral organoid with H841 cells, and TO_2, which is H841 cells growing in 2D for a comparison. We will download both and save as Seurat objects, but in this notebook we’ll focus on TO_1. In another notebook, we’ll integrate the two datasets so we can directly compare them.
To use this code, change the directory indir below.
I am using Seurat to read in the data, and since the format is similar to 10X (a folder with matrix, barcodes, and features), I’m using the Read10X function. Interestingly, the data contains multiple labels for the genes, so it gets returned as a list containing matrices of each type. We mostly care about the protein_coding list, so we will build a Seurat object using just those genes, and filter out cells with fewer than 200 features, and genes with counts in fewer than 3 cells.
indir <- "/Users/smgroves/Dropbox (VU Basic Sciences)/Quaranta_Lab/SCLC_data/scRNAseq/2022_pipseq_organoids/20220714_Quaranta_8332_pipseq/"
# raw_counts<-read.table(file=paste0(indir,"8332_TO_1/raw_matrix/matrix.mtx"),sep=",")
#
# head(raw_counts)
expression_matrix <- Read10X(data.dir = paste0(indir,"8332_TO_1/filtered_matrix/sensitivity_1"))
10X data contains more than one type and is being returned as a list containing matrices of each type.
data <- CreateSeuratObject(counts = expression_matrix$protein_coding, project = "TO_1", min.cells = 3, min.features = 200)
data
An object of class Seurat
15778 features across 2085 samples within 1 assay
Active assay: RNA (15778 features, 0 variable features)
expr_2d <-Read10X(data.dir = paste0(indir,"8332_TO_2/filtered_matrix/sensitivity_1"))
10X data contains more than one type and is being returned as a list containing matrices of each type.
data2d <- CreateSeuratObject(counts = expr_2d$protein_coding, project = "TO_2", min.cells = 3, min.features = 200)
data2d
An object of class Seurat
16290 features across 4236 samples within 1 assay
Active assay: RNA (16290 features, 0 variable features)
It looked like this filtered the protein-coding genes from 20024 to 15778 and did not filter cells (which makes sense, since we are using filtered data already from PIPseeker.) For the 2D data, the data is reduced from 20024 x 4236 to 16290 x 4236.
We can check at this point how many counts we find for each of the four TFs (ANPY).
[1] "ASCL1 counts:"
[1] 0
[1] "NEUROD1 counts:"
[1] 0
[1] "POU2F3 counts:"
[1] 8
[1] "YAP1 counts:"
[1] 829
We find that ASCL1 and NEUROD1 are dropped out of this dataset or are unexpressed.
First we’ll do QC. This was already done with PIPseeker, so we probably don’t need to filter anything else.
# Visualize QC metrics as a violin plot
VlnPlot(data, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
FeatureScatter(data, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
We will normalize the data (log-transform), find the highly variable features, and then scale the data. The Seurat object retains the normalized data in data[["RNA"]]@data, and will keep the scaled data in data[["RNA"]]@scale.data.
data <- NormalizeData(data, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
We can plot POU2F3 and YAP1 after normalization. These plots will be more interesting later, when we can look at different clusters in the data.
VlnPlot(data, features = c("POU2F3", "YAP1"))
all.genes <- rownames(data)
data <- ScaleData(data, features = all.genes)
Centering and scaling data matrix
|
| | 0%
|
|========= | 6%
|
|================== | 12%
|
|=========================== | 19%
|
|==================================== | 25%
|
|============================================ | 31%
|
|===================================================== | 38%
|
|============================================================== | 44%
|
|======================================================================= | 50%
|
|================================================================================ | 56%
|
|========================================================================================= | 62%
|
|================================================================================================== | 69%
|
|========================================================================================================== | 75%
|
|=================================================================================================================== | 81%
|
|============================================================================================================================ | 88%
|
|===================================================================================================================================== | 94%
|
|==============================================================================================================================================| 100%
After running a PCA, look at the genes driving the variance in the top dimensions.
data <- RunPCA(data, features = VariableFeatures(object = data))
PC_ 1
Positive: HMGB2, UBE2C, H2AZ1, PIMREG, BIRC5, PBK, TUBA1B, RRM2, TOP2A, KPNA2
SNRPG, CKS1B, CCNB1, RAN, MAD2L1, PTTG1, CDCA3, CDKN3, PCNA, TYMS
CENPM, TUBA1C, NDC80, MCM7, SPC25, HMGB1, KIF2C, CKS2, UBE2T, NUF2
Negative: RPL32, RPS27, RPL36A, H2AC19, RPL17, H2AC18, RPS3A, H1-2, RPL17-C18orf32, RPS29
IFI27L2, H2AC6, H2AC8, H2BC5, ISG15, TMEM88, EIF4A2, ICAM4, CYSTM1, APOE
CDKN2C, LGALS1, MKNK2, VIM, PIFO, HOATZ, TMSB10, H2AW, TUBA1A, MTERF2
PC_ 2
Positive: CALD1, CAPG, DHRS2, COMMD6, BEX1, CCN2, MARCKS, CAV1, ALDH7A1, ARG2
ID4, ACAT2, MOXD1, TIMM8B, RPL23, SELENOW, SGK1, UACA, CCN1, CAV2
CRABP1, PROCR, GSTM3, EDN1, HDDC2, JPH4, SOX4, RBMS3, PIH1D1, SHISAL2B
Negative: APOC1, FTL, CYSTM1, HOATZ, RPS3A, ENSG00000287856, PBX3, MAP2, NQO1, FTH1
AHCYL1, BLVRB, SMARCA1, APOE, PRDX1, PLA2G12A, RPL32, TMEM88, ITGAV, HLA-B
TRIM9, GAR1, GASK1B, CLIP4, RPL17-C18orf32, MACF1, FNIP2, RPL17, TENM3, TXNRD1
PC_ 3
Positive: MT-CO3, MT-ND3, MT-CO2, RPS3A, MT-CYB, MT-ATP6, RPL32, PTTG1, RPL17, MT-ND4
MT-ND2, RPL17-C18orf32, MT-CO1, CDKN3, HMGN2, MT-ND1, SNRPG, PRDX1, STMN1, RPL36A
CKS1B, RPS7, RPS27, TUBA1B, HMGB2, CCNB2, PSMA3, CHCHD2, CDC20, CCNB1
Negative: SLC38A2, CCN2, CALD1, ENC1, ITM2B, RAPGEF2, CELF2, HMGA2, TENM3, CCN1
ACTG1, YAP1, CDK6, ERBB4, CAV1, DHRS2, LHFPL6, SCD, BEX1, CAV2
VIM, EPB41L2, ATP9A, ID4, NRG1, ZFP36L1, CAPG, ENAH, APBB2, UACA
PC_ 4
Positive: TXN, FTL, PRDX1, SLIRP, MYL12B, FTH1, COX7A2, HSP90AB1, PPIA, HSPB1
HSP90AA1, SOD1, CHCHD2, HSPE1-MOB4, ACTB, HSPE1, TXNDC17, CYCS, HSPD1, ENO1
SNRPG, TMSB4X, GSTP1, SUN3, PSMA7, EIF5A, CCT5, TIMM8B, DBI, OSGIN1
Negative: RPS27, RPL36A, UBE2C, TOP2A, HMGB2, NDC80, CDK1, RPL32, KIF2C, AURKA
RPS3A, IFI27L2, CDCA8, RPS29, RACGAP1, RPL17-C18orf32, SGO2, CCNA2, CKS2, ASPM
SPC25, ENSG00000273269, RPL17, ARL6IP1, NEK2, HJURP, KNSTRN, HMMR, NUF2, PSRC1
PC_ 5
Positive: PCNA, H4C3, DHFR, GINS2, TK1, RFC2, MCM4, CENPU, RPS4Y1, MCM6
RFC4, RAD51AP1, MCM3, CENPK, CDC6, GMNN, RRM2, MCM7, RPL17, ACYP1
MCM2, MCM10, RIBC2, RPS3A, DUT, RFC5, RPA2, CLSPN, DSN1, DTL
Negative: CCNB1, KIF20A, PTTG1, CDC20, PLK1, H2AC6, H2BC5, ARL6IP1, ASPM, AURKA
SOX4, CCNB2, KPNA2, PDP1, H2AW, TPX2, DLGAP5, MORF4L2, CDKN3, NEK2
BAMBI, HNRNPA2B1, PIF1, CCNA2, CENPF, RAD21, NUCKS1, KNSTRN, CKS2, EMP1
VizDimLoadings(data, dims = 1:2, reduction = "pca")
DimPlot(data, reduction = "pca")
DimHeatmap(data, dims = 1:5, cells = 500, balanced = TRUE)
Using two methods, we’ll evaluate the best number of top PCs to keep for further analysis. None of the data will be lost, but our analyses like UMAP/tSNE below will only use the top N PCs for computational efficiency. Usually we keep between 10-30.
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
data <- JackStraw(data, num.replicate = 100)
data <- ScoreJackStraw(data, dims = 1:20)
JackStrawPlot(data, dims = 1:15)
ElbowPlot(data)
All of the top 15 PCs are well above the dotted line (normal distribution), and there is potentially an elbow in the elbow plot between 6-9 PCs. We’ll keep 10 just to be safe.
More information about the clustering method (Louvain) can be found in the Seurat tutorial. When you make a new DimPlot, the cells will automatically be colored by these new labels.
data <- FindNeighbors(data, dims = 1:10)
data <- FindClusters(data, resolution = 0.5)
DimPlot(data, reduction = "pca")
data <- RunUMAP(data, dims = 1:10)
DimPlot(data, reduction = "umap")
We can now plot our features of interest on the UMAP:
Let’s also save the output of this notebook so far:
saveRDS(data, file = "./data/TO_1.rds")
data.markers <- FindAllMarkers(data, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
data.markers %>%
group_by(cluster) %>%
slice_max(n = 4, order_by = avg_log2FC)
NA
data.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(data, features = top10$gene) + NoLegend()
saveRDS(data, file = "./data/TO_1.rds")
saveRDS(data2d, file = "./data/TO_2.rds")
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.1.1 SeuratObject_4.0.4 Seurat_4.0.5
[4] dplyr_1.0.7
loaded via a namespace (and not attached):
[1] Rtsne_0.15 colorspace_2.0-2 deldir_1.0-6
[4] ellipsis_0.3.2 ggridges_0.5.3 rstudioapi_0.13
[7] spatstat.data_2.1-0 leiden_0.3.9 listenv_0.8.0
[10] farver_2.1.0 ggrepel_0.9.1 RSpectra_0.16-0
[13] fansi_1.0.2 codetools_0.2-18 splines_4.1.1
[16] knitr_1.36 polyclip_1.10-0 jsonlite_1.7.3
[19] ica_1.0-2 cluster_2.1.2 png_0.1-7
[22] uwot_0.1.11 shiny_1.7.1 sctransform_0.3.2
[25] spatstat.sparse_2.0-0 compiler_4.1.1 httr_1.4.2
[28] assertthat_0.2.1 Matrix_1.3-4 fastmap_1.1.0
[31] lazyeval_0.2.2 cli_3.1.1 limma_3.48.3
[34] later_1.3.0 htmltools_0.5.2 tools_4.1.1
[37] igraph_1.2.11 gtable_0.3.0 glue_1.6.1
[40] RANN_2.6.1 reshape2_1.4.4 Rcpp_1.0.8
[43] scattermore_0.7 jquerylib_0.1.4 vctrs_0.3.8
[46] nlme_3.1-153 lmtest_0.9-39 xfun_0.28
[49] stringr_1.4.0 globals_0.14.0 mime_0.12
[52] miniUI_0.1.1.1 lifecycle_1.0.1 irlba_2.3.5
[55] goftest_1.2-3 future_1.23.0 MASS_7.3-54
[58] zoo_1.8-9 scales_1.1.1 spatstat.core_2.3-2
[61] promises_1.2.0.1 spatstat.utils_2.2-0 parallel_4.1.1
[64] RColorBrewer_1.1-2 yaml_2.2.1 reticulate_1.23
[67] pbapply_1.5-0 gridExtra_2.3 ggplot2_3.3.5
[70] sass_0.4.0 rpart_4.1-15 stringi_1.7.6
[73] rlang_0.4.12 pkgconfig_2.0.3 matrixStats_0.61.0
[76] evaluate_0.14 lattice_0.20-45 ROCR_1.0-11
[79] purrr_0.3.4 tensor_1.5 htmlwidgets_1.5.4
[82] labeling_0.4.2 cowplot_1.1.1 tidyselect_1.1.1
[85] parallelly_1.30.0 RcppAnnoy_0.0.19 plyr_1.8.6
[88] magrittr_2.0.1 R6_2.5.1 generics_0.1.1
[91] DBI_1.1.1 pillar_1.6.4 withr_2.4.3
[94] mgcv_1.8-38 fitdistrplus_1.1-6 survival_3.2-13
[97] abind_1.4-5 tibble_3.1.6 future.apply_1.8.1
[100] crayon_1.4.2 KernSmooth_2.23-20 utf8_1.2.2
[103] spatstat.geom_2.3-1 plotly_4.10.0 rmarkdown_2.11
[106] grid_4.1.1 data.table_1.14.2 digest_0.6.29
[109] xtable_1.8-4 tidyr_1.1.4 httpuv_1.6.3
[112] munsell_0.5.0 viridisLite_0.4.0 bslib_0.3.1